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ABSTRACT 

A growing set of on-line applications are generating data that can 
be viewed as very large collections of small, dense social graphs 

— these range from sets of social groups, events, or collabora- 
tion projects to the vast collection of graph neighborhoods in large 
social networks. A natural question is how to usefully define a 
domain-independent 'coordinate system' for such a collection of 
graphs, so that the set of possible structures can be compactly rep- 
resented and understood within a common space. In this work, we 
draw on the theory of graph homomorphisms to formulate and an- 
alyze such a representation, based on computing the frequencies 
of small induced subgraphs within each graph. We find that the 
space of subgraph frequencies is governed both by its combinato- 
rial properties — based on extremal results that constrain all graphs 

— as well as by its empirical properties — manifested in the way 
that real social graphs appear to lie near a simple one-dimensional 
curve through this space. 

We develop flexible frameworks for studying each of these as- 
pects. For capturing empirical properties, we characterize a simple 
stochastic generative model, a single-parameter extension of Erdos- 
Renyi random graphs, whose stationary distribution over subgraphs 
closely tracks the one-dimensional concentration of the real so- 
cial graph families. For the extremal properties, we develop a 
tractable linear program for bounding the feasible space of sub- 
graph frequencies by harnessing a toolkit of known extremal graph 
theory. Together, these two complementary frameworks shed light 
on a fundamental question pertaining to social graphs: what prop- 
erties of social graphs are 'social' properties and what properties 
are 'graph' properties? 

We conclude with a brief demonstration of how the coordinate 
system we examine can also be used to perform classification tasks, 
distinguishing between structures arising from different types of 
social graphs. 

Categories and Subject Descriptors: H.2.8 [Database Manage- 
ment]: Database applications — Data mining 
Keywords: Social Networks, Triadic Closure, Induced Subgraphs, 
Subgraph Census, Graph Homomorphisms. 

1. INTRODUCTION 

The standard approach to modeling a large on-line social net- 
work is to treat it as a single graph with an enormous number of 
nodes and a sparse pattern of connections. Increasingly, however, 
many of the key problems encountered in managing an on-line so- 
Copyright is held by the International World Wide Web Conference 
Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink 
to the author's site if the Material is used in electronic media. 
WWW 2013, May 13-17, 2013, Rio de Janeiro, Brazil. 
ACM 978-1-4503-2035-1/13/05. 



cial network involve working with large collections of small, dense 
graphs contained within the network. 

On Facebook, for example, the set of people belonging to a group 
or attending an event determines such a graph, and considering the 
set of all groups or all events leads to a very large number of such 
graphs. On any social network, the network neighborhood of each 
individual — consisting of his or her friends and the links among 
them — is also generally a small dense graph with a rich structure, 
on a few hundred nodes or fewer [29 J. If we consider the neighbor- 
hood of each user as defining a distinct graph, we again obtain an 
enormous collection of graphs. Indeed, this view of a large underly- 
ing social network in terms of its overlapping node neighborhoods 
suggests a potentially valuable perspective on the analysis of the 
network: rather than thinking of Facebook, for example, as a single 
billion-node network, with a global structure that quickly becomes 
incomprehensible, we argue that it can be useful to think of it as 
the superposition of a billion small dense graphs — the network 
neighborhoods, one centered at each user, and each accessible to a 
closer and more tractable investigation. 

Nor is this view limited to a site such as Facebook; one can find 
collections of small dense graphs in the interactions within a set of 
discussion forums [9|, within a set of collaborative on-line projects 
[31], and in a range of other settings. 

Our focus in the present work is on a fundamental global ques- 
tion about these types of graph collections: given a large set of 
small dense graphs, can we study this set by defining a meaningful 
'coordinate system' on it, so that the graphs it contains can be repre- 
sented and understood within a common space? With such a coor- 
dinate system providing a general-purpose framework for analysis, 
additional questions become possible. For example, when consid- 
ering collections of a billion or more social graphs, it may seem as 
though almost any graph is possible; is that the case, or are there 
underlying properties guiding the observed structures? And how 
do these properties relate to more fundamental combinatorial con- 
straints deriving from the extremal limits that govern all graphs? 
As a further example, we can ask how different graph collections 
compare to one another; do network neighborhoods differ in some 
systematic way, for instance, from social graphs induced by other 
contexts, such as the graphs implicit in social groups, organized 
events, or other arrangements? 

The Present Work. In this paper we develop and analyze such a 
representation, drawing on the theory of graph homomorphisms. 
Roughly speaking, the coordinate system we examine begins by 
describing a graph by the frequencies with which all possible small 
subgraphs occur within it. More precisely, we choose a small num- 
ber k (e.g. k — 3 or 4); then, for each graph G in a collection, we 
create a vector with a coordinate for each distinct fc-node subgraph 



H , specifying the fraction of fc-tuples of nodes in G that induce a 
copy of H (in other words, the frequency of H as an induced sub- 
graph of G). For k = 3, this description corresponds to what is 
sometimes referred to as the triad census (6l|7l|8l|32). The litera- 
ture on frequent subgraph mining |11||13||33) , and motif counting 
1 18 1 is also is closely related, but focuses on connected subgraphs. 

With each graph in the collection mapped to such a vector, we 
can ask how the full collection of graphs fills out this space of sub- 
graph frequencies. This turns out to be a subtle issue, because the 
arrangement of the graphs in this space is governed by two distinct 
sets of effects: extremal combinatorial constraints showing that cer- 
tain combinations of subgraph frequencies are genuinely impos- 
sible; and empirical properties, which reveal that the bulk of the 
graphs tend to lie close to a simple one-dimensional curve through 
the space. We formulate results on both these types of properties, 
in the former case building on an expanding body of combinatorial 
theory 1 4 , 1 7 1 for bounding the frequencies at which different types 
of subgraphs can occur in a larger ambient graph. 

The fact that the space of subgraph frequencies is constrained in 
these multiple ways also allows us to concretely address the follow- 
ing type of question: When we see that human social networks do 
not exhibit a certain type of structure, is that because such a struc- 
ture is mathematically impossible, or simply because human beings 
do not create it when they form social connections? In other words, 
what is a property of graphs and what is a property of people? Al- 
though this question is implicit in many studies of social networks, 
it is hard to separate the two effects without a formal framework 
such as we have here. 

Indeed, our framework offers a direct contribution to one of the 
most well-known observations about social graphs: the tendency of 
social relationships to close triangles, and the relative infrequency 
of what is sometimes called the 'forbidden triad': three people with 
two social relationships between them, but one absent relationship 
|22| . There are many sociological theories for why one would ex- 
pect this subgraph to be underrepresented in empirical social net- 
works 1 10 1. Our framework shows that the frequency of this 'for- 
bidden triad' has a non-trivial upper bound in not just social graphs, 
but in all graphs. Harnessing our framework more generally, we are 
in fact able to show that any k node subgraph that is not a complete 
or empty subgraph has a frequency that is bounded away from one. 
Thus, there is an extent to which almost all subgraphs are mathe- 
matically 'forbidden' from occurring beyond a certain frequency. 

We aim to separate these mathematical limits of graphs from the 
complementary empirical properties of real social graphs. The fact 
that real graph collections have a roughly one-dimensional struc- 
ture in our coordinate system leads directly to our first main ques- 
tion: is it possible to succinctly characterize the underlying back- 
bone for this one-dimensional structure, and can we use such a 
characterization to usefully describe graphs within our coordinate 
system in terms of their deviation from this backbone? 

The subgraph frequencies of the standard Erdos-Renyi random 
graph (3) G n , P produce a one-dimensional curve (parametrized by 
p) that weakly approximates the layout of the real graphs in the 
space, but the curve arising from this random graph model sys- 
tematically deviates from the real graphs in that the random graph 
contains fewer triangles and more triangle-free subgraphs. This ob- 
servation is consistent with the sociological principle of triadic clo- 
sure — that triangles tend to form in social networks. As a means of 
closing this deviation from G n , p , we develop a tractable stochastic 
model of graph generation with a single additional parameter, de- 
termining the relative rates of arbitrary edge formation and triangle- 
closing edge formation. The model exhibits rich behaviors, and for 
appropriately chosen settings of its single parameter, it produce re- 



markably close agreement with the subgraph frequencies observed 
in real data for the suite of all possible 3-node and 4-node sub- 
graphs. 

Finally, we use this representation to study how different col- 
lections of graphs may differ from one another. This arises as a 
question of basic interest in the analysis of large social media plat- 
forms, where users continuously manage multiple audiences |2| 
— ranging from their set of friends, to the members of a groups 
they've joined, to the attendees of events and beyond. Do these au- 
diences differ from each other at a structural level, and if so what 
are the distinguishing characteristics? Using Facebook data, we 
identify structural differences between the graphs induced on net- 
work neighborhoods, groups, and events. The underlying basis for 
these differences suggests corresponding distinctions in each user's 
reaction to these different audiences with whom they interact. 

2. DATA DESCRIPTION 

Throughout our presentation, we analyze several collections of 
graphs collected from Facebook's social network. The collections 
we study are all induced graphs from the Facebook friendship graph, 
which records friendship connections as undirected edges between 
users, and thus all our induced graphs are also undirected. The 
framework we characterize in this work would naturally extend to 
provide insights about directed graphs, an extension we do not dis- 
cuss. We do not include edges formed by Facebook 'subscriptions' 
in our study, nor do we include Facebook 'pages' or connections 
from users to such pages. All Facebook social graph data was ana- 
lyzed in an anonymous, aggregated form. 

For this work, we extracted three different collections of graphs, 
around which we organize our discussion: 

• Neighborhoods: Graphs induced by the friends of a single Face- 
book user ego and the friendship connections among these indi- 
viduals (excluding the ego). 

• Groups: Graphs induced by the members of a 'Facebook group' , 
a Facebook feature for organizing focused conversations between 
a small or moderate-sized set of users. 

• Events: Graphs induced by the confirmed attendees of 'Face- 
book events', a Facebook feature for coordinating invitations to 
calendar events. Users can response 'Yes', 'No', and 'Maybe' to 
such invitations, and we consider only users who respond 'Yes'. 

The neighborhood and groups collections were assembled in Oc- 
tober 2012 based on monthly active user egos and current groups, 
while the events data was collected from all events during 2010 and 
2011. For event graphs, only friendship edges formed prior to the 
date of the event were used. Subgraph frequencies for four-node 
subgraphs were computed by sampling 1 1 ,000 induced subgraphs 
uniformly with replacement, providing sufficiently precise frequen- 
cies without enumeration. The graph collections were targeted at a 
variety of different graph sizes, as will be discussed in the text. 

3. SUBGRAPH SPACE 

In this section, we study the space of subgraph frequencies that 
form the basis of our coordinate system, and the one-dimensional 
concentration of empirical graphs within this coordinate system. 
We derive a model capable of accurately identifying the backbone 
of this empirical concentration using only the basic principle of 
triadic closure, showing how the subgraph frequencies of empirical 
social graphs are seemingly restricted to the vicinity of a simple 
one-dimensional structure. 

Formally, the subgraph frequency of a fc-node graph F in an 
n-node graph G (where k < n) is the probability that a random 
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Figure 1: Subgraph frequencies for three node subgraphs for graphs of size 50, 100, and 200 (left to right). The neighborhoods are 
orange, groups are green, and events are lavender. The black curves illustrate G n , P as a function of p. 



fc-node subset of G induces a copy of F. It is clear that for any 
integer k, the subgraph frequencies of all the fc-node graphs sum 
to one, constraining the vector of frequencies to an appropriately 
dimensioned simplex. In the case of k — 3, this vector is simply the 
relative frequency of induced three-node subgraphs restricted to the 
4-simplex; there are just four such subgraphs, with zero, one, two, 
and three edges respectively. When considering the frequency of 
larger subgraphs, the dimension of the simplex grows very quickly, 
and already for k — 4, the space of four-node subgraph frequencies 
lives in an 1 1-simplex. 

Empirical distribution. In Figure [T] the three-node subgraph fre- 
quencies of 50-node, 100-node, and 200-node graph collections are 
shown, with each subplot showing a balanced mixture of 17,000 
neighborhood, group and event graphs - the three collections dis- 
cussed in Section [2] totaling 51,000 graphs at each size. Because 
these frequency vectors are constrained to the 4-simplex, their dis- 
tribution can be visualized in R 3 with three of the frequencies as 
axes. 

Notice that these graph collections, induced from disparate con- 
texts, all occupy a sharply concentrated subregion of the unit sim- 
plex. The points in the space have been represented simply as an 
unordered scatterplot, and two striking phenomena already stand 
out: first, the particular concentrated structure within the simplex 
that the points follow; and second, the fact that we can already 
discern a non-uniform distribution of the three contexts (neighbor- 
hoods, groups and events) within the space — that is, the differ- 
ent contexts can already be seen to have different structural loci. 
Notice also that as the sizes of the graphs increases - from 50 to 
100 to 200 - the distribution appears to sharpen around the one- 
dimensional backbone. The vast number of graphs that we are able 
to consider by studying Facebook data is here illuminating a struc- 
ture that is simply not discernible in previous examinations of sub- 
graph frequencies |8|, since no analysis has previously considered 
a collection near this scale. 

The imagery of FigurefTldirectly motivates our work, by visually 
framing the essence of our investigation: what facets of this curi- 
ous structure derive from our graphs being social graphs, and what 
facets are simply universal properties of all graphs? We will find, 
in particular, that parts of the space of subgraph frequencies are in 
fact inaccessible to graphs for purely combinatorial reasons — it is 
mathematically impossible for one of the points in the scatterplot 
to occupy these parts of the space. But there are other parts of the 
space that are mathematically possible; it is simply that no real so- 
cial graphs appear to be located within them. Intuitively, then, we 
are looking at a population density within an ambient space (the 
Facebook graphs within the space of subgraph frequencies), and 
we would like to understand both the geography of the inhabited 
terrain (what are the properties of the areas where the population 



has in fact settled?) and also the properties of the boundaries of the 
space as a whole (where, in principle, would it be possible for the 
population to settle?). 

Also in FigurefT] we plot the curve for the frequencies for 3 node 
subgraphs in G n , P as a function of p. The curves are given simply 
by the probability of obtaining the desired number of edges in a 
three node graph, ((1 - p) 3 ,3p(l - p) 2 ,3p 2 (l - p),p 3 ). This 
curve closely tracks the empirical density through the space, with 
a single notable discrepancy: the real world graphs systemically 
contain more triangles when compared to G„, p at the same edge 
density. We emphasize that it is not a priori clear why G„ tP would 
at all be a good model of subgraph frequencies in modestly-sized 
dense social graphs such as the neighborhoods, groups, and events 
that we have here; we believe the fact that it tracks the data with any 
fidelity at all is an interesting issue for future work. Beyond G n , P , 
in the following subsection, we present a stochastic model of edge 
formation and deletion on graphs specifically designed to close the 
remaining discrepancy. As such, our model provides a means of 
accurately characterizing the backbone of subgraph frequencies for 
social graphs. 

Stochastic model of edge formation. The classic Erdos-Renyi 
model of random graphs, G„ lP , produces a distribution over n-node 
undirected graphs defined by a simple parameter p, the probability 
of each edge independently appearing in the graph. We now in- 
troduce and analyze a related random graph model, the Edge For- 
mation Random Walk, defined as a random walk over the space of 
all unlabeled n-node graphs. In its simplest form, this model is 
closely related to G„, P , and will we show via detailed balance that 
the distribution defined by G n , P on n-node graphs is precisely the 
stationary distribution of this simplest version of the random walk 
on the space of n-node graphs. We first describe this basic version 
of the model; we then add a component to the model that captures a 
triadic closure process, which produces a close fit to the properties 
we observe in real graphs. 

Let Q n be the space of all unlabeled n-node graphs, and let X(t) 
be the following continuous time Markov chain on the state space 
Q n - The transition rates between the graphs in Q n are defined by 
random additions and deletions of edges, with all edges having a 
uniform formation rate 7 > and a uniform deletion rate 5 > 
0. Thus the single parameter v = 7/5, the effective formation 
rate of edges, completely characterizes the process. Notice that 
this process is clearly irreducible, since it is possible to transition 
between any two graphs via edge additions and deletions. 

Since X(t) is irreducible, it possesses a unique stationary dis- 
tribution. The stationary distribution of an irreducible continuous 
time Markov chain can be found as the unique stable fixed point of 
the linear dynamical system X'(t) — Q n (v)X(t) that describes 
the diffusion of probability mass during a random walk on n-node 
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Figure 2: The state transitions diagram for our stochastic graph model with k = 4, where 7 is the arbitrary edge formation rate, A 
is the triadic closure formation rate, and 5 is the edge elimination rate. 



graphs, where Q„(v) is the generator matrix with transition rates 
qij and qn — — YliM 9ji> a 'l depending only on v. The stationary 
distribution 7r„ then satisfies Q„(v)tt„ — 0. 

The following proposition shows the clear relationship between 
the stationary distribution of this simplest random walk and the fre- 
quencies of G n , p. 

PROPOSITION 3.1. The probabilities assigned to (unlabeled) 
graphs by G n , P satisfy the detailed balance condition for the Edge 
Formation Random Walk with edge formation rate v = p/1 — p, 
and thus characterizes the stationary distribution. 

PROOF. We first describe an equivalent Markov chain based on 
labeled graphs: there is a state for each labeled 71-node graph; the 
transition rate qij from a labelled graph Gi to a labelled graph Gj 
is qij = 7 if Gj can be obtained from Gi by adding an edge; and 
qij — S if Gj can be obtained from Gi by removing an edge. All 
other transition rates are zero. We call this new chain the labeled 
chain, and the original chain the unlabeled chain. 

Now, suppose there is a transition from unlabeled graph H a to 
unlabeled graph Ht in the unlabeled chain, with transition proba- 
bility k"/. This means that there are k ways to add an edge to a 
labeled copy of H a to produce a graph isomorphic to H^. Now, 
let Gi be any graph in the labeled chain that is isomorphic to H a . 
In the labeled chain, there are k transitions out of Gi leading to a 
graph isomorphic to Hb, and each of these has probability 7. Thus, 
with probability k^y, a transition out of Gi leads to a graph isomor- 
phic to Hi,. A strictly analogous argument can be made for edge 
deletions, rather than edge additions. 

This argument shows that the following describes a Markov chain 
equivalent to the original unlabeled chain: we draw a sequence of 
labeled graphs from the labeled chain, and we output the isomor- 
phism classes of these labeled graphs. Hence, to compute the sta- 
tionary distribution of the original unlabeled chain, which is what 
we seek, we can compute the stationary distribution of the labeled 
chain and then sum stationary probabilities in the labeled chain over 
the isomorphism classes of labeled graphs. 

It thus suffices to verify the detailed balance condition for the dis- 
tribution on the labeled chain that assigns probability p' B ' *'' (1 — 

p^\2)~i ( '" to each labeled graph Gi. Since every transition of 
the labeled walk occurs between two labeled graphs Gi and Gj, 
with \E(Gi)\ = \E(Gj)\ + 1, the only non-trivial detailed balance 
equations are of the form: 



qijPv[X(t) = Gi] = 
Pt[X(t)=Gi] = 
Pr[X(t)=G<] = 



©iPr[X(t) = Gj] 
uPi[X(t) = Gj] 

-Vt[X(t) = Gj 



P 



1 



Since the probability assigned to the labeled graph Gi by G n . p is 

simply p' B ' "'(1 — p)\ 2 '~' ( '", detailed balance is clearly sat- 
isfied. □ 



Incorporating triadic closure. The above modeling framework 
provides a simple analog of G n , P that notably exposes itself to sub- 
tle adjustments. By simply adjusting the transition rates between 
select graphs, this framework makes it possible to model random 
graphs where certain types of edge formations or deletions have 
irregular probabilities of occurring, simply via small perturbations 
away from the classic G n , P model. Using this principle, we now 
characterize a random graph model that differs from G n , P by a sin- 
gle parameter, A, the rate at which 3-node paths in the graph tend 
to form triangles. We call this model the Edge Formation Random 
Walk with Triadic Closure. 

Again let Q n be the space of all unlabeled ro-node graphs, and let 
Y(t) be a continuous time Markov chain on the state space Q n . As 
with the ordinary Edge Formation Random Walk, let edges have a 
uniform formation rate 7 > and a uniform deletion rate S > 0, 
but now also add a triadic closure formation rate A > for every 3- 
node path that a transition would close. The process is still clearly 
irreducible, and the stationary distribution obeys the stationary con- 
ditions Q n (v, X)n„ — 0, where the generator matrix Q n now also 
depends on A. We can express the stationary distribution directly 
in the parameters as n n (v, A) = {n : Q n [y, A)-7r = 0}. For A = 
the model reduces to the ordinary Edge Formation Random Walk. 

The state transitions of this random graph model are easy to con- 
struct for n = 3 and n = 4, and transitions for the case of n = 4 
are shown in Figure[2] Proposition |3 . 1 | above tells us that for A = 0, 
the stationary distribution of a random walk on this state space is 
given by the graph frequencies of G„ jP . As we increase A away 
from zero, we should therefore expect to see a stationary distribu- 
tion that departs from G„. p precisely by observing more graphs 
with triangles and less graphs with open triangles. 

The framework of our Edge Formation Random Walk makes it 
possible to model triadic closure precisely; in this sense the model 
forms an interesting contrast with other models of triangle-closing 
in graphs that are very challenging to analyze (e.g. (5l |12|[l3||21| 
|27| ). We will now show how the addition of this single parameter 
makes it possible to describe the subgraph frequencies of empirical 
social graphs with remarkable accuracy. 

Fitting subgraph frequencies. The stationary distribution of an 
Edge Formation Random Walk model describes the frequency of 
different graphs, while the coordinate system we are developing fo- 
cuses on the frequency of fc-node subgraphs within n-node graphs. 
For G n , p these two questions are in fact the same, since the dis- 
tribution of random induced fc-node subgraphs of G„, p is simply 
Gk, P - When we introduce A > 0, however, our model departs from 
this symmetry, and the stationary probabilities in a random walk 
on k node graphs is no longer precisely the frequencies of induced 
fc-node subgraphs in a single n-node graph. 

But if we view this as a model for the frequency of small graphs 
as objects in themselves, rather than as subgraphs of a larger am- 
bient graph, the model provides a highly tractable parameterization 






Figure 3: Subgraph frequencies for 3-node subgraphs in 50- 
node graphs, shown as a function of p. The black curves illus- 
trate G n , p , while the yellow curves illustrate the fit model. 



that we can use to approximate the structure of subgraph frequen- 
cies observed in our families of larger graphs. In doing so, we aim 
to fit Tikiyijp, A), A) as a function of p, where v(p, A) is the rate 
parameter v that produces edge density p for the specific value of 
A. For A = this relationship is simply v = p/(l — p), but for 
A > the relation is not so tidy, and in practice it is easier to fit v 
numerically rather than evaluate the expression. 

When considering a collection of graph frequencies we can fit A 
by minimizing residuals with respect to the model. Given a collec- 
tion of N graphs, let y h , . . . , y^ be the vectors of fc-node subgraph 



frequencies for each graph and p 1 , 
We can then fit A as: 



, p be the edge densities. 
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In Figure [3] we plot the three-node subgraph frequencies as a 
function of edge density p, for a collection of 300,000 50-node 
subgraphs, again a balanced mixture of neighborhoods, groups, 
and events. In this figure we also plot (in yellow) the curve re- 
sulting from fitting our random walk model with triadic closure, 
TVkiy{p, A£ p ), \° k pt ), which is thus parameterized as a function of 
edge density p. For this mixture of collections and k — 3, the 
optimal fit is Ag P = 1.61, Notice how the yellow line deviates 
from the black G n>p curve to better represent the backbone of nat- 
ural graph frequencies. From the figure it is clear that almost all 
graphs have more triangles than a sample from G„ tP of correspond- 
ing edge density. When describing extremal bounds in Section H] 
we will discuss how G n , P is in fact by no means the extremal lower 
bound. 

As suggested by Figure [2] examining the subgraph frequencies 
for four-node subgraphs is fully tractable. In FigureE] we fit A to 
the mean subgraph frequencies of our three different collections of 
graphs separately. Note that the mean of the subgraph frequencies 
over a set of graphs is not necessarily itself a subgraph frequency 
corresponding to a graph, but we fit these mean 11 -vectors as a 
demonstration of the model's ability to fit an 'average' graph. The 
subgraph frequency of G n , P at the edge density corresponding to 
the data is shown as a black dashed line in each plot — with poor 
agreement — and gray dashed lines illustrate an incremental tran- 
sition in A, starting from zero (when it corresponds to G„, p ) and 
ending at X opt . 

The striking agreement between the fit model and the mean of 
each collection is achieved at the corresponding edge density by 
fitting only A. For neighborhood graphs, this agreement deviates 
measurably on only a single subgraph frequency, the four-node star. 
The y-axis is plotted on a logarithmic scale, which makes it rather 
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Figure 4: The four-node subgraph frequencies for the means 
of the 50-node graph collections in Figure [3] and the subgraph 
frequency of the model, fitting the triadic closure rate A to the 
mean vectors. As A increases from A = to A = \ op t, we see 
how this single additional parameter provides a striking fit. 

remarkable how precisely the model describes the scarcity of the 
four-node cycle. The scarcity of squares has been previously ob- 
served in email neighborhoods on Facebook |28|, and our model 
provides the first intuitive explanation of this scarcity. 

The model's ability to characterize the backbone of the empir- 
ical graph frequencies suggests that the subgraph frequencies of 
individual graphs can be usefully studied as deviations from this 
backbone. In fact, we can interpret the fitting procedure for A as a 
variance minimization procedure. Recall that the mean of a set of 
points in R" is the point that minimizes the sum of squared residu- 
als. In this way, the procedure is in fact fitting the 'mean curve' of 
the model distribution to the empirical subgraph frequencies. 

Finally, our model can be used to provide a measure of the tri- 
adic closure strength differentially between graph collections, in- 
vestigating the difference in A opt for the subgraph frequencies of 
different graph collections. In Figure H] the three different graph 
types resulted in notably different ratios of \/p — the ratio of the 
triadic closure formation rate to the basic process rate — with a 
significantly higher value for this ratio in neighborhoods. We can 
interpret this as saying that open triads in neighborhoods are more 
prone to triadic closure than open triads in groups or events. 

4. EXTREMAL BOUNDS 

As discussed at the beginning of the previous section, we face 
two problems in analyzing the subgraph frequencies of real graphs: 
to characterize the distribution of values we observe in practice, 
and to understand the combinatorial structure of the overall space in 



which these empirical subgraph frequencies lie. Having developed 
stochastic models to address the former question, we now consider 
the latter question. 

Specifically, in this section we characterize extremal bounds on 
the set of possible subgraph frequencies. Using machinery from the 
theory of graph homomorphisms, we identify fundamental bounds 
on the space of subgraph frequencies that are not properties of so- 
cial graphs, but rather, are universal properties of all graphs. By 
identifying these bounds, we make apparent large tracts of the fea- 
sible region that are theoretically inhabitable but not populated by 
any of the empirical social graphs we examine. 

We first review a body of techniques based in extremal graph 
theory and the theory of graph homomorphisms |17| . We use these 
techniques to formulate a set of inequalities on subgraph frequen- 
cies; these inequalities are all linear for a fixed edge density, an 
observation that allows us to cleanly construct a linear program to 
maximize and minimize each subgraph frequency within the com- 
bined constraints. In this manner, we show how it is possible to map 
outer bounds on the geography of all these structural constraints. 
We conclude by offering two basic propositions that transcend all 
edge densities, thus identifying fundamental limits on subgraph fre- 
quencies of all sizes. 

4.1 Background on subgraph frequency and 
homomorphism density 

In this subsection, we review some background arising from the 
theory of graph homomorphisms. We will use this homomorphism 
machinery to develop inequalities governing subgraph frequencies. 
These inequalities allow us to describe the outlines of the space 
underlying Figure [TJ a) — the first step in understanding which as- 
pects of the distribution of subgraph frequencies in the simplex are 
the result of empirical properties of human social networks, and 
which are the consequences of purely combinatorial constraints. 

Linear constraints on subgraph frequency. Let s(F, G) denote 
the subgraph frequency of F in G, as defined in the last section: the 
probability that a random \V(F)\-node subset of G induces a copy 
of F. Note that since s(F, G) is a probability over outcomes, it is 
subject to the law of total probability. The law of total probability 
for subgraph frequencies takes the following form. 

PROPOSITION 4.1. For any graph F and any integer £ > k, 
where \V(F)\ — k, the subgraph density of F in G, s(F, G) satis- 
fies the equality 

s(F,G)= J2 s(F,H)s(H,G). 

{H:\V(H)\=t} 

PROOF. Let H' be a random ^-vertex induced subgraph of G. 
Now, the set of outcomes H — {H : \V(H)\ — £} form a partition 
of the sample space, each with probability s(H, G). Furthermore, 
conditional upon an ^-vertex induced subgraph being isomorphic 
to H, s(F, H) is the probability that a random fc-vertex induced 
subgraph of H is isomorphic to F. □ 

This proposition characterizes an important property of subgraph 
frequencies: the vector of subgraph frequencies on k nodes ex- 
ists in a linear subspace of the vector of subgraph frequencies on 
I > k nodes. Furthermore, this means that any constraint on the 
frequency of a subgraph F will also constrain the frequency of any 
subgraph H for which s(F, H) > or s(H, F) > 0. 

Graph homomorphisms. A number of fundamental inequalities 
on the occurrence of subgraphs are most naturally formulated in 



terms of graph homomorphisms, a notion that is connected to but 
distinct from the notion of induced subgraphs. In order to describe 
this machinery, we first review some basic definitions J4). if F and 
G are labelled graphs, a map / : V(F) — > V(G) is a homomor- 
phism if each edge (v,w) of F maps to an edge (f(v), f(w)) of G. 
We now write t(F, G) for the probability that a random map from 
V(F) into V(G) is a homomorphism, and we refer to t(F, G) as a 
homomorphism density of F and G. 

There are three key differences between the homomorphism den- 
sity t(F, G) and the subgraph frequency s(F, G) defined earlier in 
this section. First, t(F, G) is based on mappings of F into G that 
can be many-to-one — multiple nodes of F can map to the same 
node of G — while s(F, G) is based on one-to-one mappings. Sec- 
ond, t(F, G) is based on mappings of F into G that must map 
edges to edges, but impose no condition on pairs of nodes in F that 
do not form edges: in other words, a homomorphism is allowed to 
map a pair of unlinked nodes in F to an edge of G. This is not the 
case for s(F, G), which is based on maps that require non-edges 
of F to be mapped to non-edges of G. Third, t(F, G) is a fre- 
quency among mappings from labeled graphs F to labelled graphs 
G, while s(F, G) is a frequency among mappings from unlabeled 
F to unlabeled G. 

From these three differences, it is not difficult to write down a 
basic relationship governing the functions s and t |4|. To do this, 
it is useful to define the intermediate notion feij (F, G), which is the 
probability that a random one-to-one map from V(F) to V(G) is 
a homomorphism. Since only an 0(1/V(G)) fraction of all maps 
from V(F) to V(G) are not one-to-one, we have 



t(F,G) = t H {F,G)+0(l/\V{G)\). 



(1) 



Next, by definition, a one-to-one map / of F into G is a homomor- 
phism if and only if the image f(F), when viewed as an induced 
subgraph of G, contains all of F's edges and possibly others. Cor- 
recting also for the conversion from labelled to unlabeled graphs, 
we have 



t m] (F,G) = J2 



ext(F,F')-aut(F') 



A.:! 



s(F',G), (2) 



where aut(F') is the number of automorphisms of F' and ext(F, F') 
is the number of ways that a labelled graph F can be extended (by 
adding edges) to form a labelled graph H isomorphic to F 1 . 

Homomorphism inequalities. There are a number of non-trivial 
results bounding the graph homomorphism density, which we now 
review. By translating these to the language of subgraph frequen- 
cies, we can begin to develop bounds on the simplexes in FigurefT] 
For complete graphs, the Kruskal-Katona Theorem produces up- 
per bounds on homomorphism density in terms of the edge density 
while the Moon-Moser Theorem provides lower bounds, also in 
terms of the edge density. 

Proposition 4.2 (Kruskal-Katona [17]). Foracomplete 
graph K r on r nodes and graph G with edge density t(K2, G), 

t(K r ,G)<t(K 2 ,G) r/2 . 



Proposition 4.3 (Moon-Moser (T9J[24|). Foracomplete 
graph K r on r nodes and graph G with edge density t(K'2, G) G 
p-2)/(fe-l),l], 

i — l 
t(K T ,G)> l[(l-i(l-t(K 2 ,G))). 



The Moon-Moser bound is well known to not be sharp, and Razborov 
has recently given an impressive sharp lower bound for the homo- 
morphism density of the triangle K$ (24) using sophisticated ma- 
chinery [23]. We limit our discussion to the simpler Moon-Moser 
lower bound which takes the form of a concise polynomial and pro- 
vides bounds for arbitrary r, not just the triangle (r = 3). 

Finally, we employ a powerful inequality that is known to lower 
bound the homomorphism density of any graph F that is either 
a forest, an even cycle, or a complete bipartite graph. Stated as 
such, it is the solved special cases of the open Sidorenko Conjec- 
ture, which posits that the result could be extended to all bipartite 
graphs F. We will use the following proposition in particular when 
F is a tree, and will refer to this part of the result as the Sidorenko 
tree bound. 

Proposition 4.4 (Sidorenko |T7][25)). For a graph F that 
is a forest, even cycle, or complete bipartite graph, with edge set 
E(F), and G with edge density t(K 2 , G), 

t(F,G)>t(K 2 ,G) lE{F)i . 

Using Equations (TJ and |2l, we can translate statements about 
homomorphisms into asymptotic statements about the combined 
frequency of particular sets of subgraphs. We can also translate 
statements about frequencies of subgraphs to frequencies of their 
complements using the following basic fact. 

LEMMA 4.5. If for graphs F%, . . . Fi, coefficients at £ K, and 
a function f, 

oia(Fi, G) + ... + a e s{F e , G) > f(s(K 2 , G)), VG, 

then 

aie(Fi,G) + ■ ■ . + a e s(F e ,G)_>J{l ~ s(K 2 ,G)), VG. 
Proof. Note that s(F, G) = s(F, G). Thus if 

cns(Fi,G) + ... + ae s(F e ,G) > f(s(K 2 ,G)), VG, 
then 

ais(Fi, G) + ... + a iS (F e , G) > f(s(K 2 , G)), VG, 
where s(Z 2 ,G) = l-s(K 2 ,G). □ 

4.2 An LP for subgraph frequency bounds 

In the previous section, we reviewed linear constraints between 
the frequencies of subgraphs of different sizes, and upper and lower 
bounds on graph homomorphism densities with applications to sub- 
graph frequencies. We will now use these constraints to assemble a 
linear program capable to mapping out bounds on the extremal ge- 
ography of the subgraph space we are considering. To do this, we 
will maximize and minimize the frequency of each individual sub- 
graph frequency, subject to the constraints we have just catalogued. 

We will focus our analysis on the cases k = 3, the triad fre- 
quencies, and k = 4, the quad frequencies. Let xi,x 2 , x$,Xi de- 
note the subgraph frequencies s(-,G) of the four possible 3-vertex 
undirected graphs, ordered by increasing edge count. 

PROGRAM 4.6. The frequency Xi of a 3-node subgraph in any 
graph G with edge density p is bounded asymptotically (in \V(G)\) 
by max / min Xi subject to Xi > 0, Vi and: 



X\ +x 2 +x 3 +x 4 — 1, 

Z4<P 3/2 , 

x 4 > p(2p - 1) 
X! > (1 - p)(l - 2p) 



1 2 

^X 2 + -X'i + Xi 



(3) 



= P 

£i<(l-p) 3/2 , (4) 

P > 1/2, (5) 

V < 1/2, (6) 



(1/3)13+14 > P , Zi + (l/3)x 2 > (1-P) 



(7) 



Here the equalities in (3} derive from the linear constraints, the 
constraints in l|4} derive from Kruskal-Katona, the constraints {5} 
[6l derive from Moon-Moser, and the constraints in ffi\ derive from 
the Sidorenko tree bound. More generally, we obtain the following 
general linear program that can be used to find nontrivial bounds 
for any subgraph frequency: 

PROGRAM 4.7. The frequency /f of a k-node subgraph F in 
any graph G with edge density p is bounded asymptotically (in 
\V(G)\) by max / min /f, subject to AJf — b(p),Cfi? < d(p), 
appropriately assembled. 

From Program 1 given above it is possible to derive a simple 
upper bound on the frequency of the 3-node-path (sometimes de- 
scribed in the social networks literature as the "forbidden triad", as 
mentioned earlier). 

PROPOSITION 4.8. The subgraph frequency of the 3-node-path 
F obeys s(F, G) < 3/4 + o(l), VG. 

PROOF. Let xi, x 2 ,xs,Xi again denote the subgraph frequen- 
cies s(-,G) of the four possible 3-vertex undirected graphs, ordered 
by increasing edge count, where Xs is the frequency of the 3-node- 
path. By the linear constraints, 

(1/3)22 + (2/3)23 + x 4 = p, 

while by Moon-Moser, au + 0(l/|V(G)|) > p(2p-l). Combin- 
ing these two constraints we have: 

x 3 <3p(l-p) + o(l). 

The polynomial in p is maximized at p = 1/2, giving an upper 
bound of 3/4 + o(l). □ 

This bound on the "forbidden triad" is immediately apparent from 
Figure Bias well, which shows the bounds constructed via linear 
programs for all 3-node and 4-node subgraph frequencies. In fact, 
the subgraph frequency of the 'forbidden" 3-node-path in the bal- 
anced complete bipartite graph A'„/ 2 ,n/2> which has edge density 
p — 1/2, is exactly s(F, G) = 3/4, demonstrating that this bound 
is asymptotically tight. (In fact, we can perform a more careful 
analysis showing that it is exactly tight for even n.) 

Figure B] illustrates these bounds for k — 3 and k = 4. Notice 
that our empirical distributions of subgraph frequencies fall well 
within these bounds, leaving large tracts of the bounded area unin- 
habited by any observed dense social graph. While the bounds do 
not fully characterize the feasible region of subgraph frequencies, 
the fact that the bound is asymptotically tight at p — 1/2 for the 
complete bipartite graph K n / 2 ^ n / 2 is important — practically no 
empirical social graphs come close to the boundary, despite this ev- 
idence that it is feasibly approachable. We emphasize that an exact 
characterization of the feasible space would necessitate machinery 
at least as sophisticated as that used by Razborov. 

In the next subsection we develop two more general observa- 
tions about the subgraph frequencies of arbitrary graphs, the latter 
of which illustrates that, with the exception of clique subgraphs and 
empty subgraphs, it is always possible to be free from a subgraph. 
This shows that the lower regions of the non-clique non-empty fre- 
quency bounds in Figure [5] are always inhabitable, despite the fact 
that social graphs do not empirically populate these regions. 

4.3 Bounding frequencies of arbitrary subgraphs 

The upper bound for the frequency of the 3-node-path given in 
Proposition |4.8| amounted to simply combining appropriate upper 
bounds for different regions of possible edge densities p. In this 
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Figure 5: Subgraph frequencies for 3-node and 4-node subgraphs as function of edge density p. The light green regions denote the 
asymptotically feasible region found via the linear program. The empirical frequencies are as in Figure|3] The black curves illustrate 
G„, p , while the yellow curves illustrate the fit triadic closure model. 



section, we provide two general bounds pertaining to the subgraph 
frequency of an arbitrary subgraph F. First, we show that any sub- 
graph that is not a clique and is not empty must have a subgraph 
density bounded strictly away from one. Second, we show that for 
every subgraph F that is not a clique and not empty, it is always 
possible to construct a family of graphs with any specified asymp- 
totic edge density p that contains no induced copies of F. 

With regard to Figures B] the first of the results in this subsection 
uses the Sidorenko tree bound to show that in fact no subgraph 
other than the clique or the empty graph, not even for large values 
of k, has a feasible region that can reach a frequency of 1 — o(l). 
The second statement demonstrates that it is always possible to be 
free of any subgraph that is not a clique or an empty graph, even if 
this does not occur in the real social graphs we observe. 

PROPOSITION 4.9. For every k, there exist constants e and no 
such that the following holds. If F is a k-node subgraph that is not 
a clique and not empty, and G is any graph onn > no nodes, then 
s(F,G)<l-e. 

PROOF. Let Sk denote the fc-node star — in other words the tree 
consisting of a single node linked to k — 1 leaves. By Equation (TV 
if G has n nodes, then t m j(Sk, G) > t(Sk, G) — c/n for an absolute 
constant c. We now state our condition on e and no in the statement 
of the proposition: we choose e small enough and no large enough 
so that 

> max ( e, — J . (8) 






For a fc-node graph F, let V(F) denote the property that for all 
graphs G on at least no nodes, we have s(F, G) < 1 — e. Our 



goal is to show that V(F) holds for all fc-node F that are neither 
the clique nor the empty graph. We observe that since s(F, G) = 
s(F, G), the property V(F) holds if and only if V(F) holds. 

The basic idea of the proof is to consider any fc-node graph F 
that is neither complete nor empty, and to argue that the star Sk 
lacks a one-to-one homomorphism into at least one of F or F — 
suppose it is F. The Sidorenko tree bound says that Sk must have a 
non-trivial number of one-to-one homomorphisms into G; but the 
images of these homomorphisms must be places where F is not 
found as an induced subgraph, and this puts an upper bound on the 
frequency of F. 

We now describe this argument in more detail; we start by con- 
sidering any specific fc-node graph F that is neither a clique nor 
an empty graph. We first claim that there cannot be a one-to-one 
homomorphism from Sk into both of F and F. For if there is a 
one-to-one homomorphism from Sk into F, then F must contain a 
node of degree k — 1; this node would then be isolated in F, and 
hence there would be no one-to-one homomorphism from Sk into 
F. Now, since it is enough to prove that just one of V (F) orV(F) 
holds, we choose one of F or F for which there is no one-to-one 
homomorphism from Sk- Renaming if necessary, let us assume it 
is.F. 

Suppose by way of contradiction that s(F, G) > 1 — e. Let q 
denote the edge density of F — that is, q — \E(F)\/(S). The edge 
density p of G can be written, using Proposition |4.1| as 

p = s{K 2 ,G) = Y. s{K 2 ,H)s(H,G) 

{H:\V(H)\ = k} 

> s(K 2 ,F)s(F,G)>q(l-e). 



By a k-set of G, we mean a set of fc nodes in G. We color the 
fc-sets of G according to the following rule. Let U be a fc-set of 
G: we color U blue if G[U] is isomorphic to F, and we color U 
red if there is a one-to-one homomorphism from Sk to G[[/]. We 
leave the fc-set uncolored if it is neither blue nor red under these 
rules. We observe that no fc-set U can be colored both blue and 
red, for if it is blue, then G[U] is isomorphic to F, and hence there 
is no one-to-one homomorphism from Sk into G[[/]. Also, note 
that s(F, G) > 1 — e is equivalent to saying that at least a (1 — e) 
fraction of all fc-sets are blue. 

Finally, what fraction of fc-sets are red? By the Sidorenko tree 
bound, we have 

t(S k ,G) > p*- 1 > q k (l - e) k > %^f , 

(2) 

where the last inequality follows from the fact that F is not the 
empty graph, and hence q > !/(,). Since t m j (Sk, G) > t(Sk,G) — 
c/n, our condition on n from (18JI implies that 



tinj(Sk, G) > 



(l-e) fc 



> e. 



Now, let m](Sk, G) denote the number of one-to-one homomor- 
phisms of Sk into G; by definition, 



titq\Sk, G) — 



inj(5* fc ,G) 



inj (S k ,G) 



n(n-l)...(n-* + l) k\( n k ) 



and hence 



in.jl,S,..G) = fc!(™U nj (S fc ,G)>efc![™ 



Now, at most k\ different one-to-one homomorphisms can map Sk 
to the same fc-set of G, and hence more than e ( ?) many fc-sets of 
G are red. It follows that the fraction of fc-sets that are red is > e; 
but this contradicts our assumption that at least a (1 — e) fraction 
of fc-sets are blue, since no fc-set can be both blue and red. □ 

PROPOSITION 4.10. Assume F is not a clique and not empty. 
Then for each edge density p there exists a sequence G\ , G p , . . . 
of asymptotic edge density p for which F does not appear as an 
induced subgraph in any G?. Equivalently, s(F, G p ) = 0,Vi. 

PROOF. We call H a near-clique if it has at most one connected 
component of size greater than one, and this component is a clique. 
For any p £ [0, 1], it is possible to construct an infinite sequence 
Hf , H P , ... of near-cliques with asymptotic density p, by simply 
taking the non-trivial component of each Hf to be a clique of the 
appropriate size. 

Now, fix any p 6 [0,1], and let F be any graph that is nei- 
ther a clique nor an empty graph. If F is not a near-clique, then 
the required sequence G p , G p 2 , ... is the sequence of near-cliques 
H p , H p , . . ., since all the induced subgraphs of a near-clique are 
themselves near-cliques. 

On the other hand, if F is a near-clique, then since F is neither 
a clique nor an empty graph, the complement of F is not a near- 
clique. It follows that the required sequence G p , G p , ... is the se- 
quence of complements of the near-cliques H 1 ~ p , H 2 ~ p , . . .. □ 

Note that it is possible to take an F-free graph with asymptotic 
density p and append nodes with local edge density p and random 
(Erdos-Renyi) connections to obtain a graph with any intermedi- 
ate subgraph frequency between zero and that of G n , P - The same 
blending arguement can be applied to any graph with a subgraph 



frequency above G n , P to again find graphs with intermediate sub- 
graph frequencies. In this way we see that large tracts of the sub- 
graph frequency simplex are fully feasible for arbitrary graphs, yet 
by FigureBlare clearly not inhabited by any real world social graph. 

5. CLASSIFICATION OF AUDIENCES 

The previous two sections characterize empirical and extremal 
properties of the space of subgraph frequencies, providing two com- 
plementary frameworks for understanding the structure of social 
graphs. In this section, we conclude our work with a demonstra- 
tion of how subgraph frequencies can also provide a useful tool for 
distinguishing between different categories of graphs. The Edge 
Formation Random Walk model introduced in Section[3]figures no- 
tably, providing a meaningful baseline for constructing classifica- 
tion features, contributing to the best overall classification accuracy 
we are able to produce. 

Thus, concretely our classification task is to take a social graph 
and determine whether it is a node neighborhood, the set of peo- 
ple in a group, or the set of people at an event. This is a specific 
version of a broader characterization problem that arises generally 
in social media — namely how social audiences differ in terms of 
social graph structure 1 1 ]. Each of the three graph types we dis- 
cuss — neighborhoods, groups, and events — define an audience 
with which a user may choose to converse. The defining feature 
of such audience decisions has typically been their size — as users 
choose to share something online, do they want to share it pub- 
licly, with their friends, or with a select subgroup of their friends? 
Products such as Facebook groups exist in part to address this audi- 
ence problem, enabling the creation of small conversation circles. 
Our classification task is essentially asking: do audiences differ in 
meaningful structural ways other than just size? 

In Figure [TJ and subsequently in Figure B] we saw how the three 
types of graphs that we study — neighborhoods, groups, and events 
— are noticeably clustered around different structural foci in the 
space of subgraph frequencies. Figure [5] focused on graphs con- 
sisting of exactly 50-nodes, where it is visibly apparent that both 
neighborhoods and events tend to have a lower edge density than 
groups of that size. Neighborhood edge density — equivalent to 
the local clustering coefficient — is known to generally decrease 
with graph size |20 29 1, but it is not clear that all three of the graph 
types we consider here should decrease at the same rate. 

In Figure [6] we see that in fact the three graph types do not de- 
crease uniformly, with the average edge density of neighborhoods 
decreasing more slowly than groups or events. Thus, small groups 
are denser than neighborhoods while large groups are sparser, with 
the transition occurring at around 400 nodes. Similarly, small event 
graphs are denser than neighborhoods while large events are much 
sparser, with the transition occurring already at around 75 nodes. 

The two crossing points in Figure 6 suggest a curious challenge: 
are their structural features of audience graphs that distinguish them 
from each other even when they exhibit the same edge density? 
Here we use the language of subgraph frequencies to formulate a 
classification task for classifying audience graphs based on sub- 
graph frequencies. We compare our classification accuracy to the 
accuracy achieved when also considering a generous vector of much 
more sophisticated graph features. We approach this classifica- 
tion task using a simple logistic regression model. While more 
advanced machine learning models capable of learning richer re- 
lationships would likely produce better classification accuracies, 
our goal here is to establish that this vocabulary of features based 
on subgraph frequencies can produce non-trivial classification re- 
sults even in conjunction with simple techniques such as logistic 
regression. Evaluating our features in other contexts such as graph 
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Figure 6: Edge densities of neighborhoods, groups, and events 
as a function of size, n. When n < 400, groups are denser 
then neighborhoods. When n < 75, events are denser then 
neighborhoods. 

matching 1 14, 16. 30 1, where frequencies of connected subgraphs 
have been used previously [25 1, would be interesting future work. 
When considering neighborhood graphs, recall that we are not 
including the ego of the neighborhoods as part of the graph, while 
for groups and events the administrators as members of their graphs. 
As such, neighborhoods without their ego deviate systematicallly 
from analogous audience graphs created as groups or as events. In 
Figure [6] we also show the average edge density of neighborhoods 
with their ego, adding one node and n — 1 edges, noting that the 
difference is small for larger graphs. 

Classification features. Subgraph frequencies has been the mo- 
tivating coordinate system for the present work, and will serve as 
our main feature set. Employing the Edge Formation Random Walk 
model from Section [3] we additionally describe each graph by its 
residuals with respect to a backbone — described by the parameter 
A — fit to the complete unclassified training set. 

Features based on subgraph frequencies are local features, com- 
putable by examining only a few local nodes of the graph at a 
time. Note that the subgraph frequencies of arbitrarily large graphs 
can be accurately approximated by sampling a small number of in- 
duced graphs. Comparatively, it is relevant to ask: can these simple 
local features do as well as more sophisticated global graph fea- 
tures? Perhaps the number of connected components, the size of 
the largest component, or other global features provide highly in- 
formative features for graph classification. 

To answer this question, we compare our classification accu- 
racy using subgraph frequencies with the accuracy we are able to 
achieve using a set of global graph features. We consider: 

• Size of the k largest components, for k = 1,2. 

• Size of the fc-core, for k — 0, 1, 2, 3. 

• Number of components in the fc-core, for k — 0, 1, 2, 

• Degeneracy, the largest k for which the fc-core is non-empty. 

• Size of the fc-brace [281, f° r k = 1,2, 3. 

• Number of components in the fc-brace, for fc = 1,2,3. 

These features combine linearly to produce a rich set of graph 
properties. For example, the number of components in the 1-core 
minus the number of components in the 0-core yields the number 
of singletons in the graph. 

Classification results. The results of the classification model are 
shown in Table [T] reported in terms of classification accuracy — 
the fraction of correct classifications on the test data - measured 
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0.673 


Triads + R\ 




0.736 


0.668 


Quads 




0.751 


0.755 


Quads + Rq 




0.765 


0.769 


Quads + R\ 




0.765 


0.769 


Global + Edges 




0.694 


0.763 


Global + Triads 




0.785 


0.766 


Global + Triads + 


Rg 


0.784 


0.766 


Global + Triads + 


R\ 


0.789 


0.767 


Global + Quads 




0.797 


0.812 


Global + Quads + 


Ra 


0.807 


0.815 


Global + Quads + 


R\ 


0.809 


0.820 



Table 1: Classification accuracy for N(eighborhoods), 

G(roups), and E(vents) on different sets of features. Rg and 
Rx denote the residuals with respect to a G„, p and stochastic 
graph model baseline, as described in the text. 

using five-fold cross-validation on a balanced set of 10,000 in- 
stances. The classification tasks were chosen to be thwart classifi- 
cation based solely on edge density, which indeed performs poorly. 
Using only 4-node subgraph frequencies and residuals, an accuracy 
of 77% is achieved in both tasks. 

In comparison, classification based on a set of global graph fea- 
tures performed worse, achieving just 69% and 76% accuracy for 
the two tasks. Meanwhile, combining global and subgraph fre- 
quency features performed best of all, with a classification accu- 
racy of 81—82%. In each case we also report the accuracy with 
and without residuals as features. Incorporating residuals with re- 
spect to either a G n , P or Edge Formation Random Walk baseline 
consistently improved classification, and examining residuals with 
respect to either baseline clearly provides a useful orientation of the 
subgraph coordinate system for empirical graphs. 

6. CONCLUSION 

The modern study of social graphs has primarily focused on 
the examination of the sparse large-scale structure of human re- 
lationships. This global perspective has led to fruitful theoretical 
frameworks for the study of many networked domains, notably the 
world wide web, computer networks, and biological ecosystems 
1 20 1 . However, in this work we argue that the locally dense struc- 
ture of social graphs admit an additional framework for analyzing 
the structure of social graphs. 

In this work, we examine the structure of social graphs through 
the coordinate system of subgraph frequencies, developing two com- 
plementary frameworks that allow us to identify both 'social' struc- 
ture and 'graph' structure. The framework developed in Sectionp] 
enables us to characterize the apparent social forces guiding graph 
formation, while the framework developed in SectionEJcharacter- 
izes fundamental limits of all graphs, delivered through combina- 
torial constraints. Our coordinate system and frameworks are not 
only useful for developing intuition, but we also demonstrate how 
they can be used to accurately classify graph types using only these 
simple descriptions in terms of subgraph frequency. 

Distribution note. Implementations of the Edge Formation Ran- 
dom Walk equilibrium solver and the subgraph frequency bounds 
optimization program are available from the first author's webpage. 
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